🧬 Breast Cancer Prediction Dataset (marshalpatel3558 on Kaggle)¶

  • Dataset purpose: Supervised binary classification to predict whether a patient is alive or dead based on clinical and pathological features.

  • Data origin:
    Released under MIT License
    Copyright (c) 2013 Mark Otto.
    Copyright (c) 2017 Andrew Fong.

  • Number of samples (rows): 4,024 patient records.


🔍 Features (columns)¶

Column Name Description Example Values
Age Age of the patient at diagnosis (years) 68, 50, 58
Race Ethnicity of the patient White, Black, Other
Marital Status Marital status at time of diagnosis Married, Single, Divorced
T Stage Tumor size/stage (TNM classification) T1, T2, T3, T4
N Stage Lymph node involvement (TNM classification) N1, N2, N3
6th Stage Cancer stage (AJCC 6th edition) IIA, IIB, IIIA, IIIC, IIIB
differentiate Tumor cell differentiation Well differentiated, Poorly differentiated
Grade Histological grade of the tumor (numeric) 1, 2, 3
A Stage Anatomical stage of cancer Regional, Distant
Tumor Size Size of the tumor (mm or cm) 4, 35, 63
Estrogen Status Estrogen receptor status Positive, Negative
Progesterone Status Progesterone receptor status Positive, Negative
Regional Node Examined Number of regional lymph nodes examined 24, 14, 2
Reginol Node Positive Number of positive lymph nodes found 1, 5, 7
Survival Months Duration (in months) of patient survival after diagnosis 60, 62, 75
Status Survival status at last follow-up Alive, Dead
  • Target label: Status (Alive or Dead)

  • Missing values: None → data is complete and ready for analysis.


🧪 Use Case¶

This dataset can be used to train and evaluate machine learning models that predict patient survival based on demographic, clinical, and tumor-related characteristics. It’s well-suited for:

  • Binary classification (e.g., logistic regression, decision trees, ensemble methods)
  • Feature importance and selection
  • Survival analysis preparation
  • Exploratory data analysis and medical data preprocessing practice

Exploratory Data Analysis¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import os
import plotly.offline as pyo

pio.renderers.default='notebook'
pyo.init_notebook_mode()
In [ ]:
# Data Loading
data = r"C:\Users\alros\OneDrive\TFM UCM\Breast_Cancer.csv"
df = pd.read_csv(data)
df.head()
Out[ ]:
Age Race Marital Status T Stage N Stage 6th Stage differentiate Grade A Stage Tumor Size Estrogen Status Progesterone Status Regional Node Examined Reginol Node Positive Survival Months Status
0 68 White Married T1 N1 IIA Poorly differentiated 3 Regional 4 Positive Positive 24 1 60 Alive
1 50 White Married T2 N2 IIIA Moderately differentiated 2 Regional 35 Positive Positive 14 5 62 Alive
2 58 White Divorced T3 N3 IIIC Moderately differentiated 2 Regional 63 Positive Positive 14 7 75 Alive
3 58 White Married T1 N1 IIA Poorly differentiated 3 Regional 18 Positive Positive 2 1 84 Alive
4 47 White Married T2 N1 IIB Poorly differentiated 3 Regional 41 Positive Positive 3 1 50 Alive
In [5]:
# Show DataFrame structure: column names, types, non-null counts, and memory usage
print(df.info())

# Show number of missing values in each column
df.isnull().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4024 entries, 0 to 4023
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Age                     4024 non-null   int64   
 1   Race                    4024 non-null   category
 2   Marital_Status          4024 non-null   category
 3   T_Stage_                4024 non-null   category
 4   N_Stage                 4024 non-null   category
 5   6th_Stage               4024 non-null   category
 6   differentiate           4024 non-null   category
 7   Grade                   4024 non-null   category
 8   A_Stage                 4024 non-null   category
 9   Tumor_Size              4024 non-null   int64   
 10  Estrogen_Status         4024 non-null   category
 11  Progesterone_Status     4024 non-null   category
 12  Regional_Node_Examined  4024 non-null   int64   
 13  Reginol_Node_Positive   4024 non-null   int64   
 14  Survival_Months         4024 non-null   int64   
 15  Status                  4024 non-null   category
dtypes: category(11), int64(5)
memory usage: 202.3 KB
None
Out[5]:
Age                       0
Race                      0
Marital_Status            0
T_Stage_                  0
N_Stage                   0
6th_Stage                 0
differentiate             0
Grade                     0
A_Stage                   0
Tumor_Size                0
Estrogen_Status           0
Progesterone_Status       0
Regional_Node_Examined    0
Reginol_Node_Positive     0
Survival_Months           0
Status                    0
dtype: int64
In [ ]:
# Iterate over each column name in the DataFrame 'df'
for col in df.columns:
    # Check if the column name contains any spaces
    if ' ' in col:
        # If it does, replace spaces with underscores
        new_col = col.replace(' ', '_')
        # Rename the column in the DataFrame using the new name
        df.rename(columns={col: new_col}, inplace=True)

# Select the names of columns that are not numeric (i.e., likely categorical/text)
cat_cols = df.select_dtypes(exclude=np.number).columns

# Convert those columns to the 'category' data type to optimize memory and performance
df[cat_cols] = df[cat_cols].astype('category')
In [ ]:
# Generate summary statistics for numeric columns (count, mean, std, min, max, quartiles)
df.describe()
Out[ ]:
Age Tumor_Size Regional_Node_Examined Reginol_Node_Positive Survival_Months
count 4024.000000 4024.000000 4024.000000 4024.000000 4024.000000
mean 53.972167 30.473658 14.357107 4.158052 71.297962
std 8.963134 21.119696 8.099675 5.109331 22.921430
min 30.000000 1.000000 1.000000 1.000000 1.000000
25% 47.000000 16.000000 9.000000 1.000000 56.000000
50% 54.000000 25.000000 14.000000 2.000000 73.000000
75% 61.000000 38.000000 19.000000 5.000000 90.000000
max 69.000000 140.000000 61.000000 46.000000 107.000000
In [6]:
# Calculate the proportion of each unique value in the 'Status' column
# (e.g., what % are 'Alive' vs 'Dead') and reset the index to make it a DataFrame
status_counts = df.Status.value_counts(normalize=True).reset_index()

# Rename the columns for clarity: one for status category, one for proportion
status_counts.columns = ['Status', 'Proportion']

# Define custom colors for the pie chart
color_map = {
    'Dead': '#800020',     # Burgundy color for 'Dead'
    'Alive': '#228B22'     # Forest green for 'Alive'
}

# Create a pie chart using Plotly Express
status_dist = px.pie(
    status_counts,           # Data source
    names='Status',          # Labels in the pie slices
    values='Proportion',     # Values that determine the size of each slice
    color='Status',          # Use 'Status' to apply the custom color map
    color_discrete_map=color_map,  # Apply the custom colors
    title='Distribution of Status',  # Title of the chart
    hole=0                   # No hole → this is a full pie chart (not a donut)
)

# Show percentage and label inside each pie slice
status_dist.update_traces(textposition='inside', textinfo='percent+label')

# Display the pie chart
status_dist.show()
In [10]:
# Extract numeric and categorical columns from the DataFrame 

num_cols = df.select_dtypes(include=np.number).columns.tolist()
cat_cols_lst = cat_cols.tolist()
cat_cols_lst.remove('Status')
In [13]:
# Compute the correlation matrix for the numeric columns
corr_matrix = df[num_cols].corr()

# Create a heatmap figure using Plotly Graph Objects
corr_matrix = go.Figure(
    data=go.Heatmap(
        z=corr_matrix.values,         # Correlation values (2D array)
        x=corr_matrix.columns,        # Labels for the x-axis (column names)
        y=corr_matrix.columns,        # Labels for the y-axis (row names)
        colorscale=[                  # Custom color scale from negative to positive correlation
            [0.0, '#800020'],         # Strong negative correlation → burgundy
            [0.5, 'white'],           # Neutral correlation (0) → white
            [1.0, '#228B22']          # Strong positive correlation → green
        ],
        zmin=-1,                      # Minimum value on the color scale
        zmax=1,                       # Maximum value on the color scale
        colorbar=dict(title='Correlation')  # Title for the color scale legend
    )
)

# Customize layout: title, axis labels, and chart size
corr_matrix.update_layout(
    title='Correlation Heatmap',     # Main title of the plot
    xaxis_title='Features',          # X-axis label
    yaxis_title='Features',          # Y-axis label
    width=700,                       # Width of the figure
    height=700                       # Height of the figure
)

# Display the heatmap
corr_matrix.show()
In [14]:
# Create a grid of subplots (2 rows × 3 columns) to display multiple histograms
# Each subplot will have a title like "Histogram of <column name>"
hist = make_subplots(
    rows=2, cols=3,
    subplot_titles=[f'Histograma de {col}' for col in num_cols]
)

# Loop through each numeric column to create individual histograms
for i, col in enumerate(num_cols):
    # Create a temporary histogram figure using Plotly Express
    temp_fig = px.histogram(
        df,
        x=col,                           # Column to plot on x-axis
        nbins=30,                        # Number of bins in the histogram
        color_discrete_sequence=['#228B22']  # Use green color for bars
    )

    # Extract the actual plot trace (data and style) from the temp figure
    trace = temp_fig.data[0]

    # Determine the row and column position of the subplot
    row = i // 3 + 1                    # Integer division to get row (starts at 1)
    col_idx = i % 3 + 1                 # Modulo to get column (starts at 1)

    # Add the trace to the corresponding subplot in the grid
    hist.add_trace(trace, row=row, col=col_idx)

# Update the layout of the full figure
hist.update_layout(
    plot_bgcolor='white',              # White background for plot area
    paper_bgcolor='white',             # White background for entire figure
    font=dict(color='black'),          # Use black font color
    height=700,                        # Set the height of the figure (adjustable)
    showlegend=False                   # Hide the legend (not needed for simple histograms)
)

# Display the final figure with all histograms
hist.show()
In [15]:
# Create a subplot grid with 2 rows and 3 columns for box plots
# Each subplot gets a title like "Box Plot for <column name>"
bw = make_subplots(
    rows=2, cols=3,
    subplot_titles=[f'Box Plot for {col}' for col in num_cols]
)

# Loop through each numeric column to create individual box plots
for i, col in enumerate(num_cols):
    # Create a temporary box plot for the current column
    # 'points="outliers"' shows individual outlier points on the plot
    temp_fig = px.box(df, y=col, points='outliers')

    # Extract the plot trace (the box plot data)
    trace = temp_fig.data[0]

    # Customize trace colors: box outline and outlier markers
    trace.marker.color = 'black'      # Outlier points in black
    trace.line.color = '#228B22'      # Box outline in green

    # Calculate the subplot row and column positions (1-based)
    row = i // 3 + 1
    col_idx = i % 3 + 1

    # Add this trace to the corresponding subplot in the grid
    bw.add_trace(trace, row=row, col=col_idx)

# Update the overall layout with background colors, font, size, and no legend
bw.update_layout(
    plot_bgcolor='white',    # White plot area background
    paper_bgcolor='white',   # White full figure background
    font=dict(color='black'),# Black font color
    height=700,              # Set figure height (adjust if needed)
    showlegend=False         # Hide the legend (not necessary here)
)

# Display the box plots figure
bw.show()
In [117]:
# Define colors: start with two custom colors, then add a predefined qualitative palette from Plotly
colors = ['#228B22', '#8B0000'] + px.colors.qualitative.Dark24

# Set number of subplot rows and columns for pie charts layout
rows = 5
cols = 2

# Create a subplot figure with specified rows and columns, each subplot is a 'domain' type (for pie charts)
# The subplot titles correspond to the categorical column names (limited to rows * cols)
dist_cat = make_subplots(
    rows=rows,
    cols=cols,
    specs=[[{'type': 'domain'}]*cols]*rows,  # Each subplot is a pie chart (domain type)
    subplot_titles=[f'Distribución de {col}' for col in cat_cols_lst[:rows*cols]]  # Titles for each subplot
)

# Keep track of which legend items have been added to avoid duplicates
legend_items_added = set()

# Loop through each categorical column to plot
for i, col in enumerate(cat_cols_lst[:rows * cols]):
    row = i // cols + 1          # Calculate current subplot row (1-indexed)
    col_num = i % cols + 1       # Calculate current subplot column (1-indexed)

    # Calculate counts of each category in the current column
    counts = df[col].value_counts().reset_index()
    counts.columns = [col, 'count']

    # Create labels for the pie chart with the format "column: category_value"
    prefixed_labels = [f"{col}: {val}" for val in counts[col]]

    # Select colors for the categories (slice of the colors list)
    unique_values = counts[col].unique()
    color_selection = colors[:len(unique_values)]

    # Determine which legend items to show to avoid repeated labels in legends
    show_legend_flags = [label not in legend_items_added for label in prefixed_labels]
    legend_items_added.update(prefixed_labels)

    # Add a pie chart trace for the current categorical variable to the appropriate subplot position
    dist_cat.add_trace(
        go.Pie(
            labels=prefixed_labels,      # Labels with prefixed column name
            values=counts['count'],      # Counts for each category
            marker=dict(colors=color_selection),  # Colors for pie slices
            name=col,                   # Trace name for legend grouping
            hole=0.3,                   # Donut chart hole size
            legendgroup=col,            # Group legend entries by column
            showlegend=True             # Show legend (duplicates will be avoided by logic above)
        ),
        row=row,
        col=col_num
    )

# Update overall layout of the figure: size, title, and legend visibility
dist_cat.update_layout(
    height=300 * rows,  # Height scaled by number of rows to avoid overlap
    width=1000,         # Fixed width
    title_text="Distribución de Variables Categóricas (Gráficos de Pastel)",  # Main title
    showlegend=True     # Show legend in the figure
)

# Display the interactive pie chart subplots
dist_cat.show()
In [16]:
# Define number of rows and columns for the subplot grid
rows = 5
cols = 2

# Create subplots for bar charts of categorical variables
# Titles are like "Distribución de <column name>" for each categorical column
bar = make_subplots(
    rows=rows,
    cols=cols,
    subplot_titles=[f"Distribución de {col}" for col in cat_cols_lst[:rows * cols]]
)

# Loop through the first (rows * cols) categorical columns to plot their distributions
for i, col in enumerate(cat_cols_lst[:rows * cols]):
    # Calculate the subplot position (1-based)
    row = i // cols + 1
    col_num = i % cols + 1

    # Count the frequency of each unique category value in the column
    counts = df[col].value_counts().reset_index()
    counts.columns = [col, 'Frecuencia']  # Rename columns for clarity

    # Add a bar chart trace to the subplot for this categorical variable
    bar.add_trace(
        go.Bar(
            x=counts[col],           # Category names on x-axis
            y=counts['Frecuencia'],  # Frequency counts on y-axis
            marker_color='#228B22',  # Green bars
            name=col                 # Trace name (not shown because legend is off)
        ),
        row=row,
        col=col_num
    )

# Update layout settings for the whole figure
bar.update_layout(
    height=300 * rows,               # Height adjusted by number of rows
    width=1000,                     # Fixed width for the figure
    title_text="Distribución de Variables Categóricas",  # Figure title
    plot_bgcolor='white',           # Plot area background color
    paper_bgcolor='white',          # Full figure background color
    font=dict(color='black'),       # Black font color
    showlegend=False                # Hide the legend for cleaner appearance
)

# Show the bar charts figure
bar.show()
In [19]:
# Calculate the skewness of each numeric column in the DataFrame
# Skewness measures the asymmetry of the data distribution:
#   - Positive skew means a longer tail on the right
#   - Negative skew means a longer tail on the left
#   - Skew close to zero indicates a symmetric distribution
print(df[num_cols].skew())
Age                      -0.220373
Tumor_Size                1.739750
Regional_Node_Examined    0.829274
Reginol_Node_Positive     2.702536
Survival_Months          -0.589998
dtype: float64

Winsorizing Outliers¶

In [23]:
from scipy.stats.mstats import winsorize

def winsorize_columns(df, columns, limits=(0.01, 0.01)):
    """
    Applies winsorization to the specified numeric columns of a DataFrame.

    Parameters:
    df: pd.DataFrame - The original DataFrame.
    columns: list - List of numeric column names to winsorize.
    limits: tuple - Lower and upper limits for winsorization as fractions (default 1%).

    Returns:
    pd.DataFrame - A new DataFrame with winsorized columns.
    """
    # Create a copy of the original DataFrame to avoid modifying it directly
    df_winsorized = df.copy()
    
    # Iterate over each column specified
    for col in columns:
        # Check if the column's data type is numeric
        if pd.api.types.is_numeric_dtype(df[col]):
            # Apply winsorization to limit extreme values at both ends of the distribution
            # limits=(0.01, 0.01) means cap values below the 1st percentile and above the 99th percentile
            df_winsorized[col] = winsorize(df[col], limits=limits)
        else:
            # Inform if a column was skipped because it is not numeric
            print(f"Column '{col}' is not numeric and was skipped.")
    
    # Return the modified DataFrame with winsorized columns
    return df_winsorized

# Prepare the list of numeric columns to winsorize, excluding 'Survival_Months'
wins = [col for col in num_cols if col != 'Survival_Months']

# Create a new DataFrame with the winsorized columns applied
df_winsorized = winsorize_columns(df, wins)
In [25]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline

# Custom transformer class for winsorization compatible with scikit-learn pipelines
class Winsorizer(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None, limits=(0.01, 0.01)):
        # columns: list of columns to winsorize; if None, all columns are processed
        # limits: tuple specifying lower and upper winsorization limits as fractions
        self.columns = columns
        self.limits = limits

    def fit(self, X, y=None):
        # No fitting necessary for winsorization, so just return self
        return self

    def transform(self, X):
        # Make a copy of the input DataFrame to avoid modifying the original
        X = X.copy()
        # Determine which columns to process: specified or all columns
        cols = self.columns if self.columns else X.columns
        for col in cols:
            # Apply winsorization only on numeric columns
            if pd.api.types.is_numeric_dtype(X[col]):
                X[col] = winsorize(X[col], limits=self.limits)
        return X

# Create a scikit-learn pipeline that includes the Winsorizer transformer
# This pipeline can be easily integrated with other preprocessing or modeling steps
winsor_pipeline = make_pipeline(
    Winsorizer(columns=wins, limits=(0.01, 0.01))
)
In [26]:
selected_cols = num_cols[:6]

bw2 = make_subplots(
    rows=2, cols=3,
    subplot_titles=[f"Box Plot for {col}" for col in selected_cols]
)

for i, col in enumerate(selected_cols):
    row = i // 3 + 1
    col_num = i % 3 + 1

    bw2.add_trace(
        go.Box(
            y=df_winsorized[col],
            name=col,
            boxpoints='outliers',
            marker_color='black',
            line_color='#228B22'
        ),
        row=row, col=col_num
    )

bw2.update_layout(
    height=700,
    width=1000,
    title_text="Box Plots for Winsorized Numeric Variables",
    plot_bgcolor='white',
    paper_bgcolor='white',
    font=dict(color='black'),
    showlegend=False
)

bw2.show()
In [27]:
# Select the first 6 numeric columns for plotting histograms
selected_cols = num_cols[:6]

# Create a subplot grid with 2 rows and 3 columns, with titles for each histogram
hist2 = make_subplots(
    rows=2, cols=3,
    subplot_titles=[f"Histograma de {col}" for col in selected_cols]
)

# Loop over each selected column to add its histogram to the subplot
for i, col in enumerate(selected_cols):
    # Determine the subplot row and column position (1-based)
    row = i // 3 + 1
    col_num = i % 3 + 1

    # Add histogram trace for the winsorized column data
    hist2.add_trace(
        go.Histogram(
            x=df_winsorized[col],     # Data from winsorized DataFrame
            nbinsx=30,                # Number of bins in the histogram
            marker_color='#228B22',   # Green color for the bars
            name=col                  # Trace name (not shown since legend is off)
        ),
        row=row, col=col_num
    )

# Update overall figure layout: size, title, background, font, no legend
hist2.update_layout(
    height=700,
    width=1000,
    title_text="Histogramas de Variables Winsorizadas",  # Title in Spanish
    plot_bgcolor='white',
    paper_bgcolor='white',
    font=dict(color='black'),
    showlegend=False
)

# Display the figure with all histograms
hist2.show()

Codifying Categorical Variables¶

In [33]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler,LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
In [28]:
# Ordinal categorical variables with a defined order (to be encoded with OrdinalEncoder)
ordinal_features = [
    'T_Stage_',
    'N_Stage',
    '6th_Stage',
    'Grade',
    'differentiate'
]

# Nominal categorical variables without inherent order (to be encoded with OneHotEncoder)
nominal_features = [
    'Race',
    'Marital_Status',
    'Estrogen_Status',
    'Progesterone_Status',
    'A_Stage'
]

# Target variable (binary classification)
target = 'Status'
In [ ]:
# For each ordinal feature, print its unique values
# This helps to understand the categories so you can manually define the order
# needed later when applying OrdinalEncoder
for col in ordinal_features:
    print(f"Unique values for '{col}': {df[col].unique()}")
Unique values for 'T_Stage_': ['T1', 'T2', 'T3', 'T4']
Categories (4, object): ['T1', 'T2', 'T3', 'T4']
Unique values for 'N_Stage': ['N1', 'N2', 'N3']
Categories (3, object): ['N1', 'N2', 'N3']
Unique values for '6th_Stage': ['IIA', 'IIIA', 'IIIC', 'IIB', 'IIIB']
Categories (5, object): ['IIA', 'IIB', 'IIIA', 'IIIB', 'IIIC']
Unique values for 'Grade': ['3', '2', '1', ' anaplastic; Grade IV']
Categories (4, object): [' anaplastic; Grade IV', '1', '2', '3']
Unique values for 'differentiate': ['Poorly differentiated', 'Moderately differentiated', 'Well differentiated', 'Undifferentiated']
Categories (4, object): ['Moderately differentiated', 'Poorly differentiated', 'Undifferentiated', 'Well differentiated']
In [34]:
# Define the correct order of categories for each ordinal variable
# This order will be used by the OrdinalEncoder to map categories to numbers consistently
ordinal_categories = [
    ['T1', 'T2', 'T3', 'T4'],                    # Order for T_Stage
    ['N1', 'N2', 'N3'],                          # Order for N_Stage
    ['IIA', 'IIB', 'IIIA', 'IIIB', 'IIIC'],     # Order for 6th_Stage
    ['1', '2', '3', ' anaplastic; Grade IV'],   # Order for Grade
    ['Well differentiated', 'Moderately differentiated', 'Poorly differentiated', 'Undifferentiated']  # Order for differentiate
]

# Create an OrdinalEncoder instance with the specified category orders
ordinal_encoder = OrdinalEncoder(categories=ordinal_categories)
In [35]:
# Create a preprocessing pipeline for features only:
# - OneHotEncode nominal categorical columns
# - OrdinalEncode ordinal categorical columns
# - Pass through other columns unchanged
preprocessor = ColumnTransformer(
    transformers=[
        ('one_hot', OneHotEncoder(drop='first'), nominal_features),
        ('ordinal', ordinal_encoder, ordinal_features),
    ],
    remainder='passthrough'
)

# Apply the preprocessing to the features in the winsorized DataFrame
df_processed = preprocessor.fit_transform(df_winsorized)

# Convert the processed numpy array back to a DataFrame with feature names
X = pd.DataFrame(df_processed, columns=preprocessor.get_feature_names_out())

# Remove the target column ('Status') from features since it's included as passthrough
X.drop(columns=['remainder__Status'], inplace=True)

# --------------------------
# Handle target variable 'y' separately because:
# - The ColumnTransformer only transforms features, not the target variable.
# - LabelEncoder expects a 1D array/Series to encode labels (e.g., 'Alive', 'Dead') into numbers.
# - Including y inside ColumnTransformer would treat it as a feature, which is incorrect.
# Thus, we encode 'y' separately using LabelEncoder.
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df[target])

# Convert encoded target labels back into a DataFrame for consistency
y = pd.DataFrame(y, columns=[target])

Variable Selection¶

In [43]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.feature_selection import RFE
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.naive_bayes import  BernoulliNB
In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)
Shape of X_train: (3219, 19)
Shape of X_test: (805, 19)
Shape of y_train: (3219, 1)
Shape of y_test: (805, 1)
In [45]:
import warnings
warnings.filterwarnings("ignore")  # Suppress warnings for cleaner output

# Define the machine learning models to evaluate
models = {
    'LogisticRegression': LogisticRegression(max_iter=1000),  # Logistic Regression with max 1000 iterations
    'RandomForest': RandomForestClassifier(),                 # Random Forest classifier with default params
    'SVC': SVC(kernel='linear'),                              # Support Vector Classifier with linear kernel
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'),  # XGBoost classifier with eval metric set                                      # Naive Bayes classifier for binary features
}

# Prepare dictionary to store cross-validation results for each model
results = {name: [] for name in models.keys()}

# Initialize variables to track the best model and feature selection
best_model_name = None
best_score = 0
best_rfe = None
best_model = None
best_num_features = 0

# Get list of feature names from training set (used later for reference)
feature_names = X_train.columns.tolist()

# Setup stratified 5-fold cross-validation with shuffle and fixed random seed
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Loop through each model to perform Recursive Feature Elimination (RFE) with varying feature counts
for name, model in models.items():
    # Try selecting from 1 up to all features
    for n_features in range(1, X_train.shape[1] + 1):
        # Initialize RFE with current model and number of features to select
        rfe = RFE(estimator=model, n_features_to_select=n_features)
        
        # Fit RFE on training data and transform features
        X_rfe = rfe.fit_transform(X_train, y_train)
        
        # Evaluate model with cross-validation on the selected features using F1 score
        scores = cross_val_score(model, X_rfe, y_train, cv=cv, scoring='f1')
        
        # Calculate mean F1 score across folds
        score_mean = np.mean(scores)
        
        # Store the mean score in results for this model
        results[name].append(score_mean)

        # Update best model and feature count if this is the highest score so far
        if score_mean > best_score:
            best_score = score_mean
            best_model_name = name
            best_model = model
            best_rfe = rfe
            best_num_features = n_features
In [50]:
# Print the name of the best-performing model based on cross-validated F1 score
print(f"Best model: {best_model_name}")

# Print the best average F1 score achieved during cross-validation (rounded to 4 decimals)
print(f"Best average F1-score (CV): {best_score:.4f}")

# Print the optimal number of features selected by RFE for the best model
print(f"Optimal number of features: {best_num_features}")

# Extract the names of features selected by the best RFE object
selected_features = [feature for feature, selected in zip(feature_names, best_rfe.support_) if selected]

# Print the list of features selected by the best model's RFE
print(f"Features selected by {best_model_name}:")
print(selected_features)
Best model: RandomForest
Best average F1-score (CV): 0.6349
Optimal number of features: 11
Features selected by RandomForest:
['one_hot__Marital_Status_Married', 'one_hot__Progesterone_Status_Positive', 'ordinal__T_Stage_', 'ordinal__N_Stage', 'ordinal__6th_Stage', 'ordinal__differentiate', 'remainder__Age', 'remainder__Tumor_Size', 'remainder__Regional_Node_Examined', 'remainder__Reginol_Node_Positive', 'remainder__Survival_Months']
In [46]:
# Create a Plotly figure to visualize model performance across different numbers of selected features
rfe_models = go.Figure()

# For each model, plot the F1-score progression as the number of selected features increases
for name, scores in results.items():
    rfe_models.add_trace(go.Scatter(
        x=list(range(1, X_train.shape[1] + 1)),  # Number of features selected from 1 up to total
        y=scores,                               # Corresponding mean F1 scores from cross-validation
        mode='lines+markers',                   # Plot with lines and markers
        name=name                              # Label each trace with the model's name
    ))

# Configure the layout of the plot
rfe_models.update_layout(
    title="Evolution of F1-score by Number of Selected Features",
    xaxis_title="Number of Selected Features",
    yaxis_title="Average F1-score (CV)",
    legend_title="Model",
    template="plotly_white",
    hovermode="x unified"  # Shows all traces' y-values for a given x when hovering
)

# Show the interactive plot
rfe_models.show()
In [51]:
class RFESelector(BaseEstimator, TransformerMixin):
    def __init__(self, estimator, n_features_to_select=5):
        """
        Custom transformer that applies Recursive Feature Elimination (RFE) 
        using the provided estimator.

        Parameters:
        - estimator: a fitted model with coef_ or feature_importances_ attribute
        - n_features_to_select: number of top features to select
        """
        self.estimator = estimator
        self.n_features_to_select = n_features_to_select
        self.rfe = None      # Will hold the internal RFE object after fitting
        self.support_ = None # Boolean mask indicating which features are selected

    def fit(self, X, y=None):
        """
        Fit RFE on the data.

        Parameters:
        - X: feature matrix (DataFrame or ndarray)
        - y: target vector

        Returns:
        - self
        """
        self.rfe = RFE(self.estimator, n_features_to_select=self.n_features_to_select)
        self.rfe.fit(X, y)               # Fit RFE to data
        self.support_ = self.rfe.support_  # Store boolean mask of selected features
        return self

    def transform(self, X):
        """
        Transform X by selecting the features chosen by RFE.

        Handles both numpy arrays and pandas DataFrames:
        - If X is ndarray: select columns by boolean mask using numpy slicing.
        - If X is DataFrame: select columns by mask with DataFrame indexing.

        Parameters:
        - X: feature matrix to transform

        Returns:
        - Transformed feature matrix with only selected features
        """
        if hasattr(X, 'iloc') is False:
            # X is numpy array
            return X[:, self.support_]
        else:
            # X is pandas DataFrame
            return X.loc[:, X.columns[self.support_]]

Resampling using SMOTE, SMOTEEN, BORDERLINE-SMOTE, & ADASYN¶

In [52]:
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, ADASYN
from imblearn.combine import SMOTEENN

# Apply SMOTE (Synthetic Minority Over-sampling Technique) to balance classes by generating synthetic samples
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train[selected_features], y_train)

# Apply SMOTE combined with Edited Nearest Neighbors cleaning method to balance and clean the dataset
smoteenn = SMOTEENN(random_state=42)
X_smoteenn, y_smoteenn = smoteenn.fit_resample(X_train[selected_features], y_train)

# Apply Borderline-SMOTE, a variation of SMOTE focusing on samples near the decision boundary
borderline = BorderlineSMOTE(random_state=42)
X_borderline, y_borderline = borderline.fit_resample(X_train[selected_features], y_train)

# Apply ADASYN (Adaptive Synthetic Sampling), which adaptively generates synthetic samples for minority class
adasyn = ADASYN(random_state=42)
X_adasyn, y_adasyn = adasyn.fit_resample(X_train[selected_features], y_train)
In [53]:
def create_pca_df(X, y, n_components=3):
    """
    Perform PCA on features X, reduce to n_components dimensions,
    and return a DataFrame with principal components plus the class labels.
    
    Parameters:
    - X: feature matrix
    - y: target vector or series
    - n_components: number of PCA components to keep (default 3)
    
    Returns:
    - df_pca: DataFrame containing the principal components and class label
    - explained_variance_ratio_: array of explained variance ratios for each component
    """
    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X)  # Apply PCA to features
    
    # Create column names like 'PC1', 'PC2', ...
    cols = [f'PC{i+1}' for i in range(n_components)]
    
    # Build a DataFrame with PCA results
    df_pca = pd.DataFrame(X_pca, columns=cols)
    
    # Add class labels (target) as a column, reset index to align properly
    df_pca['Clase'] = y.squeeze().reset_index(drop=True)
    
    return df_pca, pca.explained_variance_ratio_

# List of different oversampling results to compare
remuestreos = [
    ('SMOTEENN', X_smoteenn, y_smoteenn),
    ('SMOTE', X_smote, y_smote),
    ('Borderline-SMOTE', X_borderline, y_borderline),
    ('ADASYN', X_adasyn, y_adasyn),
]

# --- 2D PCA Plot for all oversampling methods side-by-side ---
fig_2d = make_subplots(
    rows=1,
    cols=len(remuestreos),
    subplot_titles=[name for name, _, _ in remuestreos]
)

# For each oversampling dataset, create a 2D PCA scatter plot and add it as a subplot
for i, (name, X_res, y_res) in enumerate(remuestreos, start=1):
    df_pca_2d, _ = create_pca_df(X_res, y_res, n_components=2)  # PCA with 2 components
    
    # Create scatter plot colored by class
    fig_temp = px.scatter(df_pca_2d, x='PC1', y='PC2', color='Clase')
    
    # Add traces (points) to the combined subplot figure
    for trace in fig_temp.data:
        fig_2d.add_trace(trace, row=1, col=i)

# Adjust layout for visibility and display the plot
fig_2d.update_layout(height=600, width=1600, title_text='PCA 2D for Different Oversampling Methods')
fig_2d.show()

# --- 3D PCA Plot for all oversampling methods side-by-side ---
fig_3d = make_subplots(
    rows=1,
    cols=len(remuestreos),
    specs=[[{'type': 'scatter3d'}] * len(remuestreos)],  # Specify 3D scatter type for each subplot
    subplot_titles=[name for name, _, _ in remuestreos]
)

# For each oversampling dataset, create a 3D PCA scatter plot, separating points by class
for i, (name, X_res, y_res) in enumerate(remuestreos, start=1):
    df_pca_3d, _ = create_pca_df(X_res, y_res, n_components=3)  # PCA with 3 components
    
    # Plot each class separately to distinguish in legend and color
    for clase in df_pca_3d['Clase'].unique():
        df_clase = df_pca_3d[df_pca_3d['Clase'] == clase]
        
        fig_3d.add_trace(
            go.Scatter3d(
                x=df_clase['PC1'],
                y=df_clase['PC2'],
                z=df_clase['PC3'],
                mode='markers',
                marker=dict(size=3),
                name=f'{name} - Clase {clase}',
                legendgroup=str(i),  # Group legend entries per subplot
                showlegend=bool(clase == df_pca_3d['Clase'].unique()[0])  # Show legend once per group
            ),
            row=1,
            col=i
        )

# Layout adjustments and display
fig_3d.update_layout(height=600, width=1600, title_text='PCA 3D for Different Oversampling Methods')
fig_3d.show()

Data Modeling¶

Random Forest Classifier¶

In [56]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import RandomizedSearchCV
In [54]:
# Create and train a Random Forest classifier using the SMOTEENN-resampled training data
rf_1 = RandomForestClassifier(random_state=42)
rf_1.fit(X_smoteenn, y_smoteenn)

# Predict target labels on the test set using the trained model
y_pred_rf_1 = rf_1.predict(X_test[selected_features])

# Inverse transform predicted labels from encoded integers back to original class names
y_pred_rf_1 = label_encoder.inverse_transform(y_pred_rf_1)

# Inverse transform true test labels to original class names for comparison
y_test_1 = label_encoder.inverse_transform(y_test)
In [57]:
# Print classification report and confusion matrix for the Random Forest model
print(classification_report(y_test_1, y_pred_rf_1),confusion_matrix(y_test_1, y_pred_rf_1))
              precision    recall  f1-score   support

       Alive       0.94      0.91      0.92       682
        Dead       0.57      0.66      0.61       123

    accuracy                           0.87       805
   macro avg       0.75      0.78      0.77       805
weighted avg       0.88      0.87      0.87       805
 [[620  62]
 [ 42  81]]
In [ ]:
# Define the grid of hyperparameters to try for Random Forest
param_grid = {
    'n_estimators': [50, 100, 150, 200, 300, 400, 500],           # Number of trees in the forest
    'max_depth': [None, 10, 20, 30, 40, 50, 60],                 # Maximum depth of each tree
    'min_samples_split': [2, 5, 10, 15, 20],                     # Minimum samples required to split a node
    'min_samples_leaf': [1, 2, 4, 6, 8],                         # Minimum samples required at a leaf node
    'max_features': ['auto', 'sqrt', 'log2', None],               # Number of features considered for best split
    'bootstrap': [True, False],                                   # Whether bootstrap samples are used when building trees
    'criterion': ['gini', 'entropy']                             # Function to measure split quality
}

# Create a RandomizedSearchCV object to search the hyperparameter space efficiently
random_search_rf = RandomizedSearchCV(
    estimator=rf_1,               # Base model to tune
    param_distributions=param_grid,  # Hyperparameters to sample from
    n_iter=50,                   # Number of parameter settings sampled (out of all possible combinations)
    cv=5,                        # 5-fold cross-validation to estimate model performance
    scoring='f1',                # Metric to optimize: F1-score balances precision and recall
    verbose=2,                   # Print progress messages
    random_state=42,             # Seed for reproducibility
    n_jobs=-1                   # Use all CPU cores for parallel processing
)

# Run the hyperparameter search on the SMOTEENN balanced training data
random_search_rf.fit(X_smoteenn, y_smoteenn)

# Extract the best model found during search
best_model = random_search_rf.best_estimator_
Fitting 5 folds for each of 50 candidates, totalling 250 fits
Mejores hiperparámetros: {'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 60, 'criterion': 'entropy', 'bootstrap': False}
In [59]:
rf_best = best_model
rf_best.fit(X_smoteenn, y_smoteenn)
y_pred_best = rf_best.predict(X_test[selected_features])
y_pred_best=label_encoder.inverse_transform(y_pred_best)
print(classification_report(y_test_1, y_pred_best),confusion_matrix(y_test_1, y_pred_best))
              precision    recall  f1-score   support

       Alive       0.94      0.92      0.93       682
        Dead       0.59      0.65      0.62       123

    accuracy                           0.88       805
   macro avg       0.76      0.78      0.77       805
weighted avg       0.88      0.88      0.88       805
 [[626  56]
 [ 43  80]]

XGBoost Classifier¶

In [60]:
from xgboost import XGBClassifier
from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV

# Define the distribution of hyperparameters for XGBoost to sample during randomized search
param_dist = {
    "n_estimators": randint(50, 200),         # Number of boosting rounds (trees)
    "max_depth": randint(3, 10),               # Maximum depth of each tree
    "learning_rate": uniform(0.01, 0.2),       # Step size shrinkage used in update to prevent overfitting
    "subsample": uniform(0.6, 0.4),            # Fraction of samples to be used for fitting individual trees
    "colsample_bytree": uniform(0.6, 0.4),     # Fraction of features to be used for each tree
    "gamma": uniform(0, 2),                     # Minimum loss reduction required to make a further partition on a leaf node
    "reg_lambda": uniform(0, 2),                # L2 regularization term on weights (to reduce complexity)
    "reg_alpha": uniform(0, 2)                  # L1 regularization term on weights (to reduce complexity)
}

# Initialize an XGBoost classifier with parameters to avoid deprecated warnings
xgb_clf = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# Set up randomized search cross-validation to tune hyperparameters for XGBoost
random_search_xgb = RandomizedSearchCV(
    estimator=xgb_clf,          # The base XGBoost model
    param_distributions=param_dist,  # Hyperparameters to sample from
    n_iter=20,                 # Number of different hyperparameter sets to try
    scoring='f1',              # Optimize using F1-score (balance of precision and recall)
    cv=3,                      # Use 3-fold cross-validation
    verbose=1,                 # Print progress messages
    random_state=42,           # Seed for reproducibility
    n_jobs=-1                  # Use all CPU cores for parallel processing
)

# Prepare training data for XGBoost: convert to float type (required by XGBoost)
X_train_xgb = X_smoteenn.astype(float)
y_train_xgb = y_smoteenn.values.ravel().astype(float)

# Prepare test data similarly
X_test_xgb = X_test[selected_features].astype(float)

# Run the randomized search to find the best hyperparameters based on training data
random_search_xgb.fit(X_train_xgb, y_train_xgb)

# Print the best hyperparameters found by the search
print(f"Best Parameters: {random_search_xgb.best_params_}")

# Store the best XGBoost model for later use (e.g., prediction, evaluation)
best_model_xgb = random_search_xgb.best_estimator_
Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best Parameters: {'colsample_bytree': 0.6727299868828402, 'gamma': 0.36680901970686763, 'learning_rate': 0.07084844859190755, 'max_depth': 8, 'n_estimators': 138, 'reg_alpha': 0.5824582803960838, 'reg_lambda': 1.223705789444759, 'subsample': 0.6557975442608167}
In [61]:
xgb_clf_best = best_model_xgb
xgb_clf_best.fit(X_train_xgb, y_train_xgb)
y_pred_xgb = xgb_clf_best.predict(X_test_xgb)
y_pred_xgb=label_encoder.inverse_transform(y_pred_xgb)
y_test_1=label_encoder.inverse_transform(y_test)
print(classification_report(y_test_1, y_pred_best),confusion_matrix(y_test_1, y_pred_best))
              precision    recall  f1-score   support

       Alive       0.94      0.92      0.93       682
        Dead       0.59      0.65      0.62       123

    accuracy                           0.88       805
   macro avg       0.76      0.78      0.77       805
weighted avg       0.88      0.88      0.88       805
 [[626  56]
 [ 43  80]]

AdaBoost Classifier¶

In [ ]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=42)
ada = AdaBoostClassifier(estimator=base_tree, random_state=42)
ada.fit(X_smoteenn, y_smoteenn)
y_pred_ada = ada.predict(X_test[selected_features])
y_pred_ada = label_encoder.inverse_transform(y_pred_ada)
print(classification_report(y_test_1, y_pred_ada), confusion_matrix(y_test_1, y_pred_ada))
              precision    recall  f1-score   support

       Alive       0.93      0.83      0.88       682
        Dead       0.42      0.66      0.51       123

    accuracy                           0.81       805
   macro avg       0.67      0.75      0.70       805
weighted avg       0.85      0.81      0.82       805
 [[569 113]
 [ 42  81]]
In [ ]:
# Define hyperparameter grid for tuning AdaBoostClassifier
param_grid_ada = {
    'n_estimators': [50, 100, 200],               # Number of boosting rounds
    'learning_rate': [0.01, 0.1, 1.0],            # Step size shrinkage to prevent overfitting
    'estimator__max_depth': [1, 2, 3],            # Max depth of each base decision tree
    'estimator__min_samples_split': [2, 5, 10]    # Minimum samples required to split a node in the base tree
}

# Set up GridSearchCV to search for the best hyperparameters
grid_search_ada = GridSearchCV(
    estimator=ada,               # The AdaBoostClassifier instance to tune
    param_grid=param_grid_ada,  # Hyperparameter grid to explore
    cv=5,                       # 5-fold cross-validation for reliable performance estimate
    scoring='f1',               # Optimize based on F1 score (harmonic mean of precision and recall)
    verbose=1,                  # Print progress messages during search
    n_jobs=-1                   # Use all available CPU cores for parallel processing
)

# Fit GridSearchCV on the training data to find the best parameters
grid_search_ada.fit(X_train, y_train)

# Output the best found parameters and the corresponding F1 score
print("Best Parameters:", grid_search_ada.best_params_)
print("Best F1 Score:", grid_search_ada.best_score_)

# Save the best AdaBoost model found during the grid search
best_ada = grid_search_ada.best_estimator_
Fitting 5 folds for each of 81 candidates, totalling 405 fits
Best Parameters: {'estimator__max_depth': 1, 'estimator__min_samples_split': 2, 'learning_rate': 0.01, 'n_estimators': 50}
Best F1 Score: 0.6413433605722763
In [ ]:
best_ada.fit(X_smoteenn, y_smoteenn)
y_pred_ada_best = best_ada.predict(X_test[selected_features])
y_pred_ada_best = label_encoder.inverse_transform(y_pred_ada_best)
print(classification_report(y_test_1, y_pred_ada_best), confusion_matrix(y_test_1, y_pred_ada_best))
              precision    recall  f1-score   support

       Alive       0.93      0.90      0.91       682
        Dead       0.52      0.61      0.56       123

    accuracy                           0.85       805
   macro avg       0.72      0.75      0.74       805
weighted avg       0.87      0.85      0.86       805
 [[613  69]
 [ 48  75]]

Gradient Boosting Classifier¶

In [71]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(random_state=42)
gbc.fit(X_smoteenn, y_smoteenn)
y_pred_gbc = gbc.predict(X_test[selected_features])
y_pred_gbc = label_encoder.inverse_transform(y_pred_gbc)
print(classification_report(y_test_1, y_pred_gbc), confusion_matrix(y_test_1, y_pred_gbc))
              precision    recall  f1-score   support

       Alive       0.94      0.89      0.91       682
        Dead       0.52      0.67      0.58       123

    accuracy                           0.85       805
   macro avg       0.73      0.78      0.75       805
weighted avg       0.87      0.85      0.86       805
 [[605  77]
 [ 41  82]]
In [91]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# Initialize the Gradient Boosting classifier
gbc = GradientBoostingClassifier(random_state=42)

# Define the distributions/ranges for hyperparameters to sample from
param_dist = {
    'n_estimators': randint(50, 300),          # Number of boosting stages (random integers from 50 to 299)
    'learning_rate': uniform(0.01, 0.2),       # Learning rate, floats between 0.01 and 0.21
    'max_depth': randint(3, 8),                 # Tree depth, integers from 3 to 7
    'min_samples_split': randint(2, 11),        # Minimum samples to split, integers 2 to 10
    'min_samples_leaf': randint(1, 5),          # Minimum samples at leaf, integers 1 to 4
    'subsample': uniform(0.6, 0.4)               # Subsample fraction, floats from 0.6 to 1.0
}

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=gbc,
    param_distributions=param_dist,
    n_iter=30,               # Number of parameter settings that are sampled
    scoring='f1',            # Use F1 score for evaluation
    cv=5,                   # 5-fold cross-validation
    verbose=2,              # Prints progress messages
    random_state=42,
    n_jobs=-1               # Use all available CPU cores
)

# Fit RandomizedSearchCV to your training data (replace X_train, y_train with your variables)
random_search.fit(X_train, y_train)

# Print the best hyperparameters and the corresponding F1 score
print("Best Parameters:", random_search.best_params_)
print("Best F1 Score:", random_search.best_score_)

# Retrieve the best model found
best_gbc = random_search.best_estimator_
Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best Parameters: {'learning_rate': 0.02297844942179631, 'max_depth': 4, 'min_samples_leaf': 4, 'min_samples_split': 8, 'n_estimators': 179, 'subsample': 0.8849082359697769}
Best F1 Score: 0.6310373313738504
In [92]:
best_gbc.fit(X_smoteenn, y_smoteenn)
y_pred_gbc_best = best_gbc.predict(X_test[selected_features])
y_pred_gbc_best = label_encoder.inverse_transform(y_pred_gbc_best)
print(classification_report(y_test_1, y_pred_gbc_best), confusion_matrix(y_test_1, y_pred_gbc_best))
              precision    recall  f1-score   support

       Alive       0.94      0.87      0.90       682
        Dead       0.49      0.68      0.57       123

    accuracy                           0.84       805
   macro avg       0.71      0.78      0.74       805
weighted avg       0.87      0.84      0.85       805
 [[595  87]
 [ 39  84]]

Final Model: Stacking Classifier¶

In [119]:
from sklearn.ensemble import StackingClassifier

# Define the base learners for the stacking ensemble
# Each tuple contains a name and a pre-trained/fitted model
base_learners = [
    ('rf', rf_best),        # Best Random Forest model
    ('xgb', xgb_clf_best) # Best XGBoost model
]

# Define the final estimator, which will combine base learners' predictions
lr = LogisticRegression(random_state=42)  # Logistic Regression as the meta-classifier

# Create the stacking classifier
stack_model = StackingClassifier(
    estimators=base_learners,  # List of base models
    final_estimator=lr,        # Meta-model that learns how to combine base models
    passthrough=False,         # Only use predictions from base learners, not original features
    cv=5                       # Use 5-fold cross-validation to train the meta-model
)

# Fit the stacking model on the training data (features and labels)
stack_model.fit(X_train_xgb, y_train_xgb)

# Predict the labels for the test set using the stacking model
y_pred_stack = stack_model.predict(X_test_xgb)

# Convert predicted labels to integers if needed (for compatibility with label encoder)
y_pred_stack = y_pred_stack.astype(int)

# Inverse transform encoded labels back to original label names/classes
y_pred_stack = label_encoder.inverse_transform(y_pred_stack)

# Also convert the true test labels back to original class labels
y_test_1 = label_encoder.inverse_transform(y_test.astype(int))

# Print classification performance metrics (precision, recall, f1-score, support)
# and also show the confusion matrix for test predictions
print(classification_report(y_test_1, y_pred_stack), confusion_matrix(y_test_1, y_pred_stack))
              precision    recall  f1-score   support

       Alive       0.94      0.90      0.92       682
        Dead       0.54      0.67      0.60       123

    accuracy                           0.86       805
   macro avg       0.74      0.79      0.76       805
weighted avg       0.88      0.86      0.87       805
 [[612  70]
 [ 40  83]]
In [122]:
from sklearn.metrics import f1_score, precision_score, recall_score

y_proba_stack = stack_model.predict_proba(X_test_xgb)
probs_dead_stack = y_proba_stack[:, 1]

y_test_bin_stack = y_test

thresholds_stack = np.arange(0.1, 0.9, 0.05)
f1_scores_stack = []
recalls_stack = []
precisions_stack = []

for t in thresholds_stack:
    preds_stack = (probs_dead_stack >= t).astype(int)

    f1_stack = f1_score(y_test_bin_stack, preds_stack)
    recall_stack = recall_score(y_test_bin_stack, preds_stack)
    precision_stack = precision_score(y_test_bin_stack, preds_stack)

    f1_scores_stack.append(f1_stack)
    recalls_stack.append(recall_stack)
    precisions_stack.append(precision_stack)

best_threshold_stack = thresholds_stack[np.argmax(f1_scores_stack)]
best_f1_stack = max(f1_scores_stack)

fig_stack = go.Figure()

fig_stack.add_trace(go.Scatter(
    x=thresholds_stack, y=f1_scores_stack,
    mode='lines+markers',
    name='F1-score Stacked',
    line=dict(color='royalblue')
))

fig_stack.add_trace(go.Scatter(
    x=thresholds_stack, y=recalls_stack,
    mode='lines+markers',
    name='Recall Stacked',
    line=dict(color='green')
))

fig_stack.add_trace(go.Scatter(
    x=thresholds_stack, y=precisions_stack,
    mode='lines+markers',
    name='Precision Stacked',
    line=dict(color='firebrick')
))

fig_stack.add_trace(go.Scatter(
    x=[best_threshold_stack],
    y=[best_f1_stack],
    mode='markers+text',
    name='Mejor F1 Stack_model',
    marker=dict(color='red', size=10),
    text=[f'Mejor: {best_threshold_stack:.2f}'],
    textposition='top center'
))

fig_stack.update_layout(
    title='Métricas según el umbral de decisión (Modelo Stacked)',
    xaxis_title='Umbral de decisión',
    yaxis_title='Score',
    template='plotly_white',
    hovermode='closest'
)

fig_stack.show()
In [120]:
# Get predicted probabilities for the test set from the stacking model
y_proba_stack = stack_model.predict_proba(X_test_xgb)

# Extract probabilities of the positive class (usually class 1)
probs_dead = y_proba_stack[:, 1]

# Set the decision threshold manually to 0.8
# This was chosen because it gives the best balance between precision and recall
threshold = 0.8

# Convert probabilities to binary predictions based on the threshold
y_pred_thresh = (probs_dead >= threshold).astype(int)

# Convert encoded predictions back to original class labels
y_pred_final = label_encoder.inverse_transform(y_pred_thresh)

# Convert true labels back to original class labels
y_test_final = label_encoder.inverse_transform(y_test.astype(int))

# Print classification report (precision, recall, f1-score, support) for evaluation
print(classification_report(y_test_final, y_pred_final))

# Print confusion matrix to see TP, FP, FN, TN counts
print(confusion_matrix(y_test_final, y_pred_final))
              precision    recall  f1-score   support

       Alive       0.93      0.93      0.93       682
        Dead       0.62      0.62      0.62       123

    accuracy                           0.88       805
   macro avg       0.78      0.78      0.78       805
weighted avg       0.88      0.88      0.88       805

[[636  46]
 [ 47  76]]
In [ ]:
import shap  

# Get the Random Forest estimator from the stacking classifier
estimator_rf = stack_model.named_estimators_['rf']

# Create a SHAP explainer object for the Random Forest model using the training dataset
# This prepares SHAP to calculate the contribution of each feature to predictions
explainer_rf = shap.Explainer(estimator_rf, X_train_xgb)

# Compute SHAP values for all samples in the training set
# shap_values_rf will contain SHAP values for each feature, for each sample, and for each class (if classification)
shap_values_rf = explainer_rf(X_train_xgb)

# Plot a beeswarm plot of SHAP values for the class with index 1 (usually the positive class)
# The beeswarm plot shows the impact of each feature on the prediction across samples,
# where color represents feature value and position represents SHAP value (importance)
shap.plots.beeswarm(shap_values_rf[:, :, 1])
100%|===================| 8631/8640 [02:35<00:00]        
No description has been provided for this image
In [ ]:
# Extract the XGBoost classifier from the stacking model by its name 'xgb'
estimator_xgb = stack_model.named_estimators_['xgb']

# Create a SHAP explainer object for the XGBoost model using the training data
# This initializes the SHAP framework to compute feature contributions for this model
explainer_xgb = shap.Explainer(estimator_xgb, X_train_xgb)

# Calculate SHAP values for all samples in the training set
# These values quantify how each feature influences the prediction for each sample
shap_values_xgb = explainer_xgb(X_train_xgb)

# Generate a beeswarm plot to visualize the SHAP values across all features and samples
# The plot shows feature impact on the model output, where color indicates feature value
shap.plots.beeswarm(shap_values_xgb)
 96%|=================== | 4143/4320 [00:21<00:00]       
No description has been provided for this image
In [105]:
for name, model in stack_model.named_estimators_.items():
    if hasattr(model, 'feature_importances_'):
        print(f"Feature importance from {name}:")
        for feature, importance in zip(X_smoteenn.columns, model.feature_importances_):
            print(f"{feature}: {importance}")
        print("\n")
Feature importance from rf:
one_hot__Marital_Status_Married: 0.047137544697609245
one_hot__Progesterone_Status_Positive: 0.05723504347968043
ordinal__T_Stage_: 0.021759227511246197
ordinal__N_Stage: 0.09154596397316372
ordinal__6th_Stage: 0.06801860973600278
ordinal__differentiate: 0.081122974535775
remainder__Age: 0.05787916588402614
remainder__Tumor_Size: 0.04516788204086069
remainder__Regional_Node_Examined: 0.047407989791646225
remainder__Reginol_Node_Positive: 0.06837428706881273
remainder__Survival_Months: 0.4143513112811768


Feature importance from xgb:
one_hot__Marital_Status_Married: 0.08079218119382858
one_hot__Progesterone_Status_Positive: 0.09488501399755478
ordinal__T_Stage_: 0.03712789714336395
ordinal__N_Stage: 0.24990662932395935
ordinal__6th_Stage: 0.09759394079446793
ordinal__differentiate: 0.10152040421962738
remainder__Age: 0.02577250264585018
remainder__Tumor_Size: 0.02142142504453659
remainder__Regional_Node_Examined: 0.024313760921359062
remainder__Reginol_Node_Positive: 0.05845312774181366
remainder__Survival_Months: 0.20821313560009003


Feature importance from ada:
one_hot__Marital_Status_Married: 0.0
one_hot__Progesterone_Status_Positive: 0.0
ordinal__T_Stage_: 0.0
ordinal__N_Stage: 0.0
ordinal__6th_Stage: 0.0
ordinal__differentiate: 0.0
remainder__Age: 0.0
remainder__Tumor_Size: 0.0
remainder__Regional_Node_Examined: 0.0
remainder__Reginol_Node_Positive: 0.0
remainder__Survival_Months: 1.0


Feature importance from gbc:
one_hot__Marital_Status_Married: 0.027802606371882765
one_hot__Progesterone_Status_Positive: 0.009901675992614177
ordinal__T_Stage_: 0.020726237414516447
ordinal__N_Stage: 0.10188451631775784
ordinal__6th_Stage: 0.022300360131321792
ordinal__differentiate: 0.08710066509061073
remainder__Age: 0.02036515848769322
remainder__Tumor_Size: 0.0057951560928509935
remainder__Regional_Node_Examined: 0.014820447634640216
remainder__Reginol_Node_Positive: 0.04904147429981064
remainder__Survival_Months: 0.6402617021663012


In [123]:
# Initialize an empty DataFrame to collect feature importance data from each model
df_importances = pd.DataFrame()

# Loop through each base estimator in the stacking model
for name, model in stack_model.named_estimators_.items():
    # Check if the model has the attribute 'feature_importances_' (some models like tree-based have it)
    if hasattr(model, 'feature_importances_'):
        # Create a temporary DataFrame with feature names, their importance scores, and the model name
        temp_df = pd.DataFrame({
            'feature': X_smoteenn.columns,  # Feature names from the training data used
            'importance': model.feature_importances_,  # Corresponding importance values
            'model': name  # Name of the model (e.g., 'rf', 'xgb')
        })
        # Append this temporary DataFrame to the main DataFrame holding all models' importances
        df_importances = pd.concat([df_importances, temp_df], axis=0)

# Import plotly express for easy plotting
import plotly.express as px

# Create a horizontal grouped bar plot to visualize feature importances for each model
fig_fp = px.bar(
    df_importances,
    x='importance',  # Importance values on the x-axis
    y='feature',  # Feature names on the y-axis
    color='model',  # Differentiate bars by model with colors
    orientation='h',  # Horizontal bars
    title='Feature Importances by Model',  # Chart title
    height=600,  # Set the height of the figure for better readability
    color_discrete_map={  # Define specific colors for each model
        'rf': 'forestgreen',
        'xgb': 'darkred'
    }
)

# Adjust layout to group bars side-by-side for each feature, separated by model
fig_fp.update_layout(barmode='group')

# Display the interactive plot
fig_fp.show()

Pipeline con umbral Ajustado¶

In [124]:
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.base import BaseEstimator, ClassifierMixin

# Custom classifier wrapper that applies a decision threshold to predicted probabilities
class ThresholdClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, model, threshold=0.5):
        """
        Initialize with:
        - model: any sklearn-like classifier with predict_proba method
        - threshold: probability cutoff to assign class label 1
        """
        self.model = model
        self.threshold = threshold

    def fit(self, X, y):
        """
        Fit the wrapped model on training data.
        """
        self.model.fit(X, y)
        return self  # Return self to allow chaining

    def predict(self, X):
        """
        Predict class labels by applying the threshold on the probability
        of the positive class (class 1).
        """
        proba = self.model.predict_proba(X)[:, 1]  # Get probabilities for class 1
        return (proba >= self.threshold).astype(int)  # Apply threshold and return labels

    def predict_proba(self, X):
        """
        Return the predicted probabilities from the wrapped model.
        """
        return self.model.predict_proba(X)

    def score(self, X, y):
        """
        Use the wrapped model's score method (default accuracy).
        """
        return self.model.score(X, y)
In [125]:
# Define a pipeline using imblearn's Pipeline to handle imbalanced data and custom steps
pipeline_T = ImbPipeline(steps=[
    # Step 1: Apply Winsorization to numerical columns to limit extreme outliers
    ('winsor', Winsorizer(columns=num_cols, limits=(0.01, 0.01))),

    # Step 2: Preprocessing step, usually includes transformations like scaling, encoding, etc.
    ('preprocessing', preprocessor),

    # Step 3: Feature selection using Recursive Feature Elimination (RFE) with a Random Forest estimator
    # Selects the best_num_features most important features based on rf_1 model
    ('feature_selection', RFESelector(estimator=rf_1, n_features_to_select=best_num_features)),

    # Step 4: Resample data to handle class imbalance with SMOTEENN (combines oversampling and cleaning)
    ('resampling', smoteenn),

    # Step 5: Classification using the ThresholdClassifier wrapper around the stacked model
    # Applies a custom threshold of 0.8 for predicting positive class
    ('model', ThresholdClassifier(model=stack_model, threshold=0.8))
])

# Fit the entire pipeline on the training data (X_train2, y_train2)
pipeline_T.fit(X_train2, y_train2)

# Predict labels on the test data (X_test2)
y_pred_2 = pipeline_T.predict(X_test2)

# Convert predicted numeric labels back to original labels using the label encoder
y_pred_2 = label_encoder.inverse_transform(y_pred_2)

# Print classification metrics and confusion matrix comparing true vs predicted labels
print(classification_report(y_test2, y_pred_2), confusion_matrix(y_test2, y_pred_2))
              precision    recall  f1-score   support

       Alive       0.93      0.93      0.93       682
        Dead       0.61      0.61      0.61       123

    accuracy                           0.88       805
   macro avg       0.77      0.77      0.77       805
weighted avg       0.88      0.88      0.88       805
 [[635  47]
 [ 48  75]]
In [ ]:
import pickle

# Open (or create) a file named "pipelin_stack_VF.pkl" in write-binary mode
with open("pipelin_stack_VF.pkl", "wb") as f:
    # Serialize and save the pipeline_VF object into the file
    pickle.dump(pipeline_VF, f)